Sarah Van Oss
Department of Anthropology, School of Liberal Arts, Tulane University
Pulling It TogetheR: Collecting and Collating Quantitative Data from Written
Reports for Analysis using R and ChatGPT
ACKNOWLEDGEMENTS
Thank you to the Proyecto Arqueogico Waka’ for their continued
support and facilitation of this research; the Middle American
Research Institute at Tulane University and Dr. Marcello Canuto for
continued research support.
GOAL: DEVELOP CODING IN R TO
COLLECT DATA FROM TEXTS
Create an efficient system for data
collection and organization
Reduce data collection time
Facilitate collation and comparison of
data across investigations, time, space
Lay groundwork for answering questions
using large datasets, relational databases
PROBLEM: RETRIEVING DATA FROM
ARCHIVAL OR TEXT DOCUMENTS
Accessing Data from Archival or Text Documents
Much archaeological data is housed in written reports.
Published reports, such as informes used throughout
Mesoamerican archaeology, communicate findings to
governing bodies and the public. Accessing this data for
later analysis, however, can be time-consuming and
difficult.
Collecting Data Manually From Written Documents
When done manually, data collection from written texts
varies greatly and can take days to weeks. This is
particularly true when the investigator is unfamiliar with
the language, concepts, or project.
Creating Comparable, Large Datasets
Data housed in such reports are difficult to compare with
other datasets due to medium and formatting. Comparing
data from different years, investigators, projects, sites,
etc. must be done manually.
Answering Big Questions With Big Data
Research questions in archaeology are limited by the
availability of data. This includes accessibility of the data
record (e.g. written reports) and what data is feasible to
produce through traditional excavation. If data from
written reports can be accessed and understood quickly,
it will lead to better formulated research questions, more
efficient excavations, and the opportunity to expand
research scope beyond a single investigators capacity.
METHOD
1. Compile project report texts into a readable file
I use Excel to create separate observations, or data
entries, for analysis by the R code. To segment written
archaeological data, each observation equates to one
lotthe smallest unit of excavation.
2. Define the data categories desired for collection
e.g. The primary data needed for this project are artifact
counts, excavation depths, and contextual designations.
3. Identify and collapse text patterns for each data
category
Textual patterns vary in written language. In order for the
code to correctly collect data, it is necessary to identify
the words and phrases associated with each data
category. For instance, “ceramics” might be indicated by
words like “sherds,” “pieces,“fragments,” etc.
4. Extract quantities associated with each textual
pattern for each observation
Once text patterns that indicate a data category have
been identified, R can recognize those patterns and
extract the quantities associated with each observation.
5. Compile quantities into a database for analysis
Once collected, these data are compiled into a database
for large-scale and comparative analysis.
RESULTS
Data Collected
The table below was built using the methods described,
“translated” from written reports to analyzable database.
The code is structured for easy expansion or contraction
based on what data is required to answer relevant
research questions. This example uses only ceramics:
larger databases can include lithic, obsidian, and figurine
counts, or other data like Munsell designations or
elevation measurements.
Creating and Analyzing a Database
After data is collected, it is combined with other report
data and compiled into a larger database for further
analysis. This database could be a simple spreadsheet or
a more elaborate relational database. Project data can
then be efficiently analyzed and visualized, as shown
below (number of lithics per lot).
DISCUSSION/CONCLUSION
Once developed, implementing this program in R
reduces data collection time by up to 80% when
compared to manual entry.
High variability in textual patterns require constant
iteration of the code and consistent data verification.
Some cleaning and manipulation of the text is
required to ensure data accuracy.
Variation across investigators, projects, sites,
excavation methods, etc. is to be expected. These
changes can be accounted for during collection and
data verification.
Future Directions:
1) After collection, data will be collated into larger
databases for large-scale comparison.
2) Future research will explore the use of relational
databases and structured query language (SQL) to
further improve data comparison across a projects
timeline and among various projects.
3) This method can also encompass spatial data for
analysis in a spatial software or GIS.
SIGNIFICANCE/BROADER IMPACTS
Implementing this data collection method
allows investigators to address more
expansive research questions through the
creation of larger datasets in less time,
saving investigation funds and labor-hours.
Leveraging automated code is an easily
adoptable solution for managing and
analyzing extensive data collected by long-
term projects or held within archives,
enabling re-engagement with previous
research and the expansion of current
efforts.
Creating a code to collect quantitative data
from text documents and making it publicly
available will facilitate greater access to data
from publicly available sources, allowing for
more equitable access to data for scholars
and community-led research initiatives.
Contextual Information in Text Form
WK90
-A-4-1-123 es el nivel del humus y contiene 4
tiestos.
WK90
-A-4-2-124 tiene una profundidad de 30cm y 32
tiestos de
ceramica.
WK90
-A-4-1-126 tiene mucho suelo y raices. Se
recuperon
24 tiestos de ceramica.
#Define the variations for ceramics
ceramics_variations <- c("tiestos de ceramica",
"tiestos")
#Collapse the variations into a single pattern
ceramics_pattern <- paste(ceramics_variations,
collapse = "|")
#Extract the quantities associated with the
variations
ceramics <- str_extract(context,
paste0("(?i)(\\d+) (?=", ceramics_pattern, ")"))
#Combine data collected into a data category
data <- data.frame("Ceramic Count" = ceramics)
#Apply the data extraction function to each row
of the dataframe
all_data <- lapply(input_data$context,
extract_data)
Combined Data Frame (fictional data)
Operation
Sub-op Unit Level Lot
Ceramic
Count
WK90 A 4 1 123 4
WK90 A 4 2 124 32
WK90 A 5 1 126 24
Using R to collect and collate data
from written reports allows for:
1) Efficiency of archival data
collection
2) Archival data to be incorporated
into new analysis
3) New analyses using large datasets
4) Equitable access to published data
ChatGPT AND CODE CREATION
Generative AI like ChatGPT makes coding accessible
to the average user through focused prompting.
This poster was presented at the 2024
Society for American Archaeology
Annual Meeting in New Orleans, LA.
It took 3rd place in the Society for
Archaeological Sciences’ Graduate
Student Poster Competition
Sarah Van Oss
Department of Anthropology, School of Liberal Arts, Tulane University
Pulling It TogetheR: Collecting and Collating Quantitative Data from Written
Reports for Analysis using R and ChatGPT
ACKNOWLEDGEMENTS
Thank you to the Proyecto Arqueológico Waka’ for their continued
support and facilitation of this research; the Middle American
Research Institute at Tulane University and Dr. Marcello Canuto for
continued research support.
GOAL: DEVELOP CODING IN R TO
COLLECT DATA FROM TEXTS
Create an efficient system for data
collection and organization
Reduce data collection time
Facilitate collation and comparison of
data across investigations, time, space
Lay groundwork for answering questions
using large datasets, relational databases
PROBLEM: RETRIEVING DATA FROM
ARCHIVAL OR TEXT DOCUMENTS
Accessing Data from Archival or Text Documents
Much archaeological data is housed in written reports.
Published reports, such as informes used throughout
Mesoamerican archaeology, communicate findings to
governing bodies and the public. Accessing this data for
later analysis, however, can be time-consuming and
difficult.
Collecting Data Manually From Written Documents
When done manually, data collection from written texts
varies greatly and can take days to weeks. This is
particularly true when the investigator is unfamiliar with
the language, concepts, or project.
Creating Comparable, Large Datasets
Data housed in such reports are difficult to compare with
other datasets due to medium and formatting. Comparing
data from different years, investigators, projects, sites,
etc. must be done manually.
Answering Big Questions With Big Data
Research questions in archaeology are limited by the
availability of data. This includes accessibility of the data
record (e.g. written reports) and what data is feasible to
produce through traditional excavation. If data from
written reports can be accessed and understood quickly,
it will lead to better formulated research questions, more
efficient excavations, and the opportunity to expand
research scope beyond a single investigators capacity.
METHOD
1. Compile project report texts into a readable file
I use Excel to create separate observations, or data
entries, for analysis by the R code. To segment written
archaeological data, each observation equates to one
lotthe smallest unit of excavation.
2. Define the data categories desired for collection
e.g. The primary data needed for this project are artifact
counts, excavation depths, and contextual designations.
3. Identify and collapse text patterns for each data
category
Textual patterns vary in written language. In order for the
code to correctly collect data, it is necessary to identify
the words and phrases associated with each data
category. For instance, “ceramics” might be indicated by
words like “sherds,” “pieces,” “fragments,” etc.
4. Extract quantities associated with each textual
pattern for each observation
Once text patterns that indicate a data category have
been identified, R can recognize those patterns and
extract the quantities associated with each observation.
5. Compile quantities into a database for analysis
Once collected, these data are compiled into a database
for large-scale and comparative analysis.
RESULTS
Data Collected
The table below was built using the methods described,
“translated” from written reports to analyzable database.
The code is structured for easy expansion or contraction
based on what data is required to answer relevant
research questions. This example uses only ceramics:
larger databases can include lithic, obsidian, and figurine
counts, or other data like Munsell designations or
elevation measurements.
Creating and Analyzing a Database
After data is collected, it is combined with other report
data and compiled into a larger database for further
analysis. This database could be a simple spreadsheet or
a more elaborate relational database. Project data can
then be efficiently analyzed and visualized, as shown
below (number of lithics per lot).
DISCUSSION/CONCLUSION
Once developed, implementing this program in R
reduces data collection time by up to 80% when
compared to manual entry.
High variability in textual patterns require constant
iteration of the code and consistent data verification.
Some cleaning and manipulation of the text is
required to ensure data accuracy.
Variation across investigators, projects, sites,
excavation methods, etc. is to be expected. These
changes can be accounted for during collection and
data verification.
Future Directions:
1) After collection, data will be collated into larger
databases for large-scale comparison.
2) Future research will explore the use of relational
databases and structured query language (SQL) to
further improve data comparison across a project’s
timeline and among various projects.
3) This method can also encompass spatial data for
analysis in a spatial software or GIS.
SIGNIFICANCE/BROADER IMPACTS
Implementing this data collection method
allows investigators to address more
expansive research questions through the
creation of larger datasets in less time,
saving investigation funds and labor-hours.
Leveraging automated code is an easily
adoptable solution for managing and
analyzing extensive data collected by long-
term projects or held within archives,
enabling re-engagement with previous
research and the expansion of current
efforts.
Creating a code to collect quantitative data
from text documents and making it publicly
available will facilitate greater access to data
from publicly available sources, allowing for
more equitable access to data for scholars
and community-led research initiatives.
Contextual Information in Text Form
WK90
-A-4-1-123 es el nivel del humus y contiene 4
tiestos.
WK90
-A-4-2-124 tiene una profundidad de 30cm y 32
tiestos de
ceramica.
WK90
-A-4-1-126 tiene mucho suelo y raices. Se
recuperon
24 tiestos de ceramica.
#Define the variations for ceramics
ceramics_variations <- c("tiestos de ceramica",
"tiestos")
#Collapse the variations into a single pattern
ceramics_pattern <- paste(ceramics_variations,
collapse = "|")
#Extract the quantities associated with the
variations
ceramics <- str_extract(context,
paste0("(?i)(\\d+) (?=", ceramics_pattern, ")"))
#Combine data collected into a data category
data <- data.frame("Ceramic Count" = ceramics)
#Apply the data extraction function to each row
of the dataframe
all_data <- lapply(input_data$context,
extract_data)
Combined Data Frame (fictional data)
Operation
Sub-op Unit Level Lot
Ceramic
Count
WK90 A 4 1 123 4
WK90 A 4 2 124 32
WK90 A 5 1 126 24
Using R to collect and collate data
from written reports allows for:
1) Efficiency of archival data
collection
2) Archival data to be incorporated
into new analysis
3) New analyses using large datasets
4) Equitable access to published data
ChatGPT AND CODE CREATION
Generative AI like ChatGPT makes coding accessible
to the average user through focused prompting.